Natural language processing to identify lupus nephritis phenotype in electronic health records

Background Systemic lupus erythematosus (SLE) is a rare autoimmune disorder characterized by an unpredictable course of flares and remission with diverse manifestations. Lupus nephritis, one of the major disease manifestations of SLE for organ damage and mortality, is a key component of lupus classification criteria. Accurately identifying lupus nephritis in electronic health records (EHRs) would therefore benefit large cohort observational studies and clinical trials where characterization of the patient population is critical for recruitment, study design, and analysis. Lupus nephritis can be recognized through procedure codes and structured data, such as laboratory tests. However, other critical information documenting lupus nephritis, such as histologic reports from kidney biopsies and prior medical history narratives, require sophisticated text processing to mine information from pathology reports and clinical notes. In this study, we developed algorithms to identify lupus nephritis with and without natural language processing (NLP) using EHR data from the Northwestern Medicine Enterprise Data Warehouse (NMEDW). Methods We developed five algorithms: a rule-based algorithm using only structured data (baseline algorithm) and four algorithms using different NLP models. The first NLP model applied simple regular expression for keywords search combined with structured data. The other three NLP models were based on regularized logistic regression and used different sets of features including positive mention of concept unique identifiers (CUIs), number of appearances of CUIs, and a mixture of three components (i.e. a curated list of CUIs, regular expression concepts, structured data) respectively. The baseline algorithm and the best performing NLP algorithm were externally validated on a dataset from Vanderbilt University Medical Center (VUMC). Results Our best performing NLP model incorporated features from both structured data, regular expression concepts, and mapped concept unique identifiers (CUIs) and showed improved F measure in both the NMEDW (0.41 vs 0.79) and VUMC (0.52 vs 0.93) datasets compared to the baseline lupus nephritis algorithm. Conclusion Our NLP MetaMap mixed model improved the F-measure greatly compared to the structured data only algorithm in both internal and external validation datasets. The NLP algorithms can serve as powerful tools to accurately identify lupus nephritis phenotype in EHR for clinical research and better targeted therapies. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-024-02420-7.


Introduction
Systemic Lupus Erythematosus (SLE) is an autoimmune disease that has diverse manifestations, resulting in significant morbidity and mortality [1,2].While many autoimmune diseases, such as rheumatoid arthritis, have benefitted from new classes of medications, SLE has seen few advancements in therapy in the last 50 years [3].It has been hypothesized that the heterogeneity of SLE presentations may make it challenging to understand therapeutic responses across the full scope of SLE presentations and that observational cohort studies and clinical trials would benefit from targeting subpopulations with similar disease presentations [4].Recently, the Food and Drug Administration has approved two new medications for use in managing lupus nephritis, increasing the urgency of identifying lupus nephritis in people with SLE to ensure the new therapeutics can be targeted to these patients to help reduce kidney damage and improve long term outcomes [5].Classification criteria for SLE describe a broad range of evidence-based clinical and laboratory descriptors.There are three criteria currently in use: 1) the set developed in 1982 and revised in 1997 by the American College of Rheumatology (ACR) [6], 2) the set developed by the System Lupus International Collaborating Clinics in 2012 (SLICC) [2], and 3) the set developed by the European League Against Rheumatism / American College of Rheumatology (EULAR/ACR) criteria set [7].Lupus nephritis is one of the most common and severe manifestations of SLE.Approximately 40% of SLE patients develop lupus nephritis [8] and it is included in all three classification criteria sets.In both the SLICC and EULAR/ACR criteria, one way to be classified as "definite lupus" is having a positive anti-nuclear antibody/anti-dsDNA screen in the presence of renal biopsyproven lupus nephritis [2,7].Thus, lupus nephritis is a critical attribute to describe for clinical and research applications and the identification of SLE subpopulations, but often it requires time consuming chart adjudication to identify patients who satisfy this criterion.
Electronic health records (EHRs) are a readily available data source that includes a record of clinical care and procedures, diagnoses, laboratory test results, medication orders, and clinical notes for describing disease manifestations in persons with SLE.EHRs have been demonstrated useful in genome association studies, drug comparative effectiveness studies [9,10], and others.However, a large amount of information in the EHR, such as histology notes for kidney biopsies, is generally only located in text-based notes from which it is challenging to extract information using simple rule-based identification algorithms and text string searches [11,12].Several prior studies developed algorithms to identify lupus nephritis using administrative claims data [13].Chibnik et al. identified lupus nephritis in claims data and reached a positive predictive value (PPV) of 88% but sensitivity and specificity were not mentioned [14].Li et al. used various combinations of International Classification of Diseases (ICD) codes to identify lupus nephritis [15].Their algorithm achieved good sensitivity and specificity but a low positive predictive value (PPV) of 63.4%.Most of these studies only used structured data (i.e.ICD codes, laboratory test value), and the algorithms were often not validated in an external dataset [14,15].Thus, correctly identifying lupus nephritis from EHR for large cohort studies, in addition to identifying critical procedures, diagnoses and lab results, also requires the development of natural language processing (NLP) tools that can utilize histology reports and clinical notes.Previously, studies with other structured data-based concepts (e.g.multiple sclerosis, rheumatoid arthritis) have demonstrated that NLP can significantly improve rate of identification [11,16].
In this study, we focus on the identification of lupus nephritis in the SLICC criteria in EHR data using NLP technologies to mine clinical notes and pathology reports.To do this, we compared algorithms for the identification of lupus nephritis based on structured data alone to four different NLP models to determine whether NLP could improve identification of persons with lupus nephritis.Our approach facilitates accurate identification of lupus nephritis in the EHR, enabling researchers to better understand patients' SLE characteristics and serving as a foundation for lupus nephritis-related large cohort observational studies and clinical trials.We trained and evaluated the performance of all four algorithms in a dataset from Northwestern Medicine Electronic Data Warehouse (NMEDW) and then further validated the performance in an external dataset from Vanderbilt University Medical Center (VUMC).

Data source
The Chicago Lupus Database (CLD), established in 1991, is a registry database specifically designed for lupus related studies.It is a physician validated registry of 1,052 patients with possible or definite lupus according to the 1982 American College of Rheumatology classification criteria revised in 1997 [17,18].The patients in the CLD met at least three ACR criteria (step 1 in Fig. 1).Among the 1052 patients in the CLD, 878 patients had definite lupus according to the Systemic Lupus International Collaborating Clinics (SLICC) classification criteria (step 2 in Fig. 1) [2].Among these patients, 178 have lupus nephritis according to the definition in SLICC.The presence or absence of lupus nephritis in patients in the CLD is verified by the physician chart review.
The Northwestern Medicine Electronic Data Warehouse (NMEDW) is the primary data repository for all the medical records of patients who receive care within the Northwestern Medicine system [19].Established in 2007, the NMEDW contains records for over 3.8 million patients, with most EHR data going back to at least 2002, and with some billing claims data going back to 1998 or earlier.By linking patients in the CLD to patient records in the NMEDW through their medical record numbers, we identified 818 definite SLE patients based on SLICC criteria who were both in the CLD and the NMEDW (see step 3 in Fig. 1).To ensure our patient cohort has sufficient depth of data in both data sources, we excluded any patients who had less than four clinical encounters documented in the NMEDW [20,21], reducing the final case cohort size to 472 (see step 4 in Fig. 1).All inpatient and outpatient notes from transplant, nephrology, and rheumatology departments were retrieved.The retrieved clinical narratives included pathology reports, progress notes, consult notes, and discharge notes.

Algorithm development
In this study, we focus on identifying renal criterion/ lupus nephritis in the SLICC classification which is defined as "having a urine protein/creatinine ratio (or 24-h urine protein collection) equivalent to 500 mg of protein per 24-h period, or red blood cell casts in the urine" [2].The renal criterion/lupus nephritis in the SLICC classification includes both biopsy-proven and non-biopsy-proven nephritis.To set up the gold standard label for lupus nephritis, the physicians in our team who are expert clinicians on lupus, performed chart review using data from the CLD which has more in-depth information on lupus related information compared to EHR data.The physicians also excluded other causes of glomerular disease when adjudicating the diagnosis of lupus nephritis.
We developed five algorithms (see Table 1 for the overview of the five algorithms) to identify lupus nephritis from SLE patients' EHR data including a baseline algorithm that used only structured data and four NLP models that used structured data and clinical notes.In the baseline algorithm, a patient is classified as lupus nephritis based on ICD9/10 diagnosis codes and laboratory test results.The details of the structured data used in the baseline algorithm are shown in Additional file 1: Table S1.For the NLP models, following the steps in Zeng et al. [22][23][24], we extracted different feature sets for model implementation including concept unique identifier (CUI) features and regular expression (regex) matches from the notes.For the CUI features, we first preprocessed the notes by removing duplicated records and tokenizing sentences.We then applied MetaMap to annotate medical concepts in each sentence [25].Meta-Map is an NLP application that maps biomedical text to the Unified Medical Language System (UMLS) Metathesaurus and assigns a CUI to each word or term [26].Any CUIs recognized as being negated by MetaMap (i.e., "no glomerulonephritis") were excluded.For regex features, five concepts were used as features, including nephritis class II, nephritis class III, nephritis class IV, nephritis Fig. 1 SLE case cohort selection process.We identified 1052 SLE patients who met at least 3 ACR criteria based on physician chart review.Among these 1052 patients, we further identified 878 patients who also met SLICC classification criteria.Among the 878 patients, 818 patients were in NMEDW.We further restricted our study cohort to patients who had at least 4 encounters in the NMEDW which left 472 patients in the final cohort.Abbreviations: ACR criteria, American College of Rheumatology Classification Criteria; CLD, Chicago Lupus Database; SLE, systemic lupus erythematosus; NMEDW, Northwestern Medicine Enterprise Data Warehouse class V, and proteinuria.We developed regular expression patterns to search for text related to the five concepts (see Additional file 1: Table S1 for the list of regex patterns).We built four NLP models using different feature sets.In the first NLP model, we implemented rule-based algorithm using both regex features and structured data.A patient is classified as lupus nephritis if they have any match for the regex patterns, ICD 9/10 codes, or laboratory test of interest (see Additional file 1: Table S1, S2).For the other three NLP models, we implemented an L2-regularized logistic regression classifier.We chose L2-regularized logistic regression because it can handle high dimensional feature space and multicollinearity problems by penalizing its coefficients in the loss function.In addition, the model is straightforward, and model output is easy to interpretate.We tried both L1 and L2-regularized logistic regression and selected the latter because it generates equivalent if not superior performance compared to L1-regularized logistic regression in our NU dataset.In the first L2-regularized logistics regression-based NLP model-the full MetaMap (binary) model, all positive mentioned MetaMap CUIs were used as binary type features.In the second L2-regularized logistics regression-based NLP model-the full MetaMap (count) model, the number of occurrences for every positive mapped CUIs were used as features.The minimum document frequency was set as 30 and 40 in MetaMap (binary) model and MetaMap (count) model, respectively to avoid feature sparsity.The frequencies were chosen by trying a list of frequencies and the ones generated the highest F measure were selected.In the last L2-regularized logistics regression-based NLP modelthe MetaMap mixed model, we used a mixture of lupus nephritis related CUIs, structured data, and regex concepts as features.The CUIs include C0024143, C0268757, C0268758, C4053955, C4053958, C4053959, C4054543 (see Additional file 1: Table S3 for each CUI definition).For the structured data component, a single binary feature is used.A patient is indicated to be positive for the structured data feature if he/she is predicted positive in the baseline algorithm.There were 13 variables in total for the MetaMap mixed model including 7 features from CUIs, 5 lupus nephritis related concepts for regex expression search, and 1 feature from structured data.

Model training and evaluation
We split the data from NMEDW into training (75%) and testing datasets (25%).In the training dataset, to get the optimal hyperparameter, we used grid search on parameter C, which is the inverse of regularization strength, ranging from 1e-5 to 1e5 with interval spacing equal to 10.For the L2-regularized logistics regression-based NLP models, we selected "sag" method as our optimizer [27].We set the class weight as balanced to adjust for disproportionate class frequencies.Parameters that generated the best accuracy were retained.We evaluated our model in the testing set (internal validation) based on sensitivity, specificity, PPV, negative predictive value (NPV), F measure, and area under the curve (AUC).We further explored feature contribution by extracting the top 5 features with the highest positive coefficient in MetaMap (binary), Met-aMap (count), and MetaMap mixed model, respectively.We also evaluated feature importance by generating mean absolute Shapley value (SHAP) plots.L2-regularized logistic regression was conducted using 'scikit-learn' library in Python, version 3.7.3.Regular expression was performed using 're' package in Python, version 3.7.3[27,28].Shapley value was generated using 'shap' package in Python, version 3.7.3.SHAP plot was generated using 'matplotlib' package in Python, version 3.7.3.

External validation
We further validated both the baseline algorithm, and the best performing NLP model (based on results from the testing set at Northwestern University site) in an external validation dataset at Vanderbilt University Medical Center (VUMC), a regional, tertiary care center [29,30].The VUMC data warehouse contains over 3.2 million subjects with de-identified clinical records from the EHR collected across the past several decades.We first performed a simple SLE phenotyping algorithm based on SLE ICD9/10 codes to get a SLE cohort (not chart reviewed) on which to run our lupus nephritis algorithm.We then randomly selected 75 patients on which to evaluate our lupus nephritis algorithm.A rheumatologist manually reviewed the chart for these 75 patients.Among these patients, there were 18 patients with definite lupus, 1 with possible SLE, and 56 with no SLE.Among these 75 patients, there were 14 patients with lupus nephritis all of whom had definite lupus and 61 patients without.We evaluated the F measure, sensitivity, specificity, PPV, and NPV for the lupus nephritis baseline algorithm and NLP model with the highest F measure based on the results from the Northwestern University (NU) dataset.F measure evaluates the accuracy of the algorithm, it is calculated as the following: Here precision and recall are also known as PPV and sensitivity, respectively.

Results
Among the 472 SLE patients at NU, there were 178 patients (37.7% of the cohort) who developed lupus nephritis.The average number of notes per patient is 68.58 (standard deviation [SD] = 59.37).The distribution of the number of notes for the patient cohort is shown in Fig. 2. Out of the 472 patients, 206 had ICD codes related to lupus nephritis, 4 had red blood cell cast test, and 230 had urine protein test results available.
The performance for the five algorithms is shown in Table 2.All four NLP models have higher sensitivity, specificity, PPV, and NPV compared to the baseline algorithm using structured data alone.All the logistics regression-based NLP models had higher F measure compared to rule-based NLP model using structured data and regex patterns.The full MetaMap (binary)    3).Therefore, we selected the Meta-Map mixed model as the final NLP model to be validated at VUMC in addition to the baseline algorithm.
In the VUMC dataset, which included 75 patients, the MetaMap mixed model has higher sensitivity, specificity, PPV, and NPV compared to the baseline algorithm.
The F measure improved from 0.52 to 0.93 as shown in Table 2.
In terms of feature importance, the top 5 features with the highest positive coefficient for each classifier are shown in Table 3. C0024143 (lupus nephritis) appears to have high positive coefficient in all three L2-regualrized classifiers.C1962972 (proteinuria finding) are the 4th highest positive coefficient in both MetaMap (binary) and MetaMap (count) model.Our full MetaMap models are able to pick up many important lupus nephritis related concepts such as kidney disease, proteinuria, lupus nephritis as high coefficient features.The SHAP plot shows the top 10 most important features for classification in each model.As shown in Figs. 4, 5 and 6, most of the important features are related to lupus nephritis clinically.

Discussion
In this study, we developed five algorithms to identify lupus nephritis: a baseline algorithm using structured data only, a rule-based model using regex and structured data, a full MetaMap model with binary features, a full MetaMap model with count features, and a MetaMap mixed model.In the NU testing dataset, the MetaMap mixed model outperformed (F measure = 0.79) both the baseline algorithm (F measure = 0.41) and the other two  NLP models (F measure = 0.71, 0.71 respectively).In the VUMC validation dataset, the MetaMap mixed model significantly improved the F measure over the baseline algorithm (0.93 versus 0.52).

Error analysis
In the MetaMap mixed model, we investigated 10 SLE patients in the training set that were wrongly classified by L2-regularized logistic regression.One patient was wrongly predicted as negative for lupus nephritis with a 0.49 probability of having lupus nephritis.In the feature set the algorithm identified, the patient was positive for CUI C002413 (glomerulonephritis in the context of systemic lupus erythematosus) and was negative for all the other features.It was mentioned in the notes that the patient had 'stage 2 LN' .Lupus nephritis class II is one of the features used in our algorithm.However, our regex did not include this specific variation of wording for lupus nephritis class II.This pattern could be incorporated in the NLP in the future to improve algorithm performance.
In another example, a 26-year-old female was wrongly predicted as positive for lupus nephritis with a probability of 0.53 of having lupus nephritis.In the feature set the algorithm identified, the patient was positive for C0024143 (glomerulonephritis in the context of systemic lupus erythematosus) and proteinuria features both of which were positively associated with lupus nephritis.Our algorithm showed that the patient had matched for 'proteinuria > 0.5' in the notes which was in the context of 'negative renal disorder: either persistent proteinuria (> 0.5 g/day or + + +) or cellular casts' .Our regex pattern was not able to capture the negation at the beginning of the sentence.Therefore, it falsely predicted the patient as positive for lupus nephritis.
All NLP models outperformed the baseline algorithm in the NU testing (internal validation) dataset.
In the baseline model, 20/35 lupus nephritis patients were wrongly classified as non-lupus nephritis patients, while the MetaMap mixed model reduced the misclassified cases to 9/35.The baseline algorithm relies solely on ICD 9/10 diagnosis and laboratory test results.In the baseline rule-based algorithm, laboratory tests missing from the EHR largely influenced the performance.In the NLP MetaMap mixed model, using features from multiple modalities (EHR notes-derived regex, CUIs features, laboratory tests, and ICD codes) that complement each other, and a penalized logistic regression model improved the accuracy and generalizability of the model.As part of the future work, we plan to apply advanced imputation methods [31,32] to fill in missing laboratory tests in order to further improve the phenotyping performance.

Limitations
Our study has certain limitations.Firstly, even though we had physician adjudication to set up gold standard label, this could still be imperfect in the cases of lack of patient biopsy or other information to identify the true label of lupus nephritis.Although, the impact of such was minimized by using the CLD registry as the data source which has more in-depth lupus related information compared to EHR data and can help more accurately set up the gold standard label.Secondly, we only had 75 patients in the VUMC validation dataset.This is due to limited resources for chart review.The prevalence of lupus nephritis is high among our lupus population in the external VUMC dataset.This is likely contributing to the fact that VUMC is a tertiary care center which has sicker patients and the small sample size which may increase the chance of sample bias.Future study is needed to further validate our algorithm performance in a larger external dataset.

Conclusion
In conclusion, we developed five algorithms, a structured data only algorithm and four NLP models, to identify lupus nephritis phenotypes.We evaluated the algorithms in an internal and an external validation dataset.All four NLP models outperformed the baseline algorithm in the internal validation dataset.In the external validation dataset, our NLP MetaMap mixed model improved the F-measure greatly compared to the structured data only algorithm.Our NLP algorithms can serve as powerful tools to accurately identify lupus nephritis phenotype in EHR for clinical research and better targeted therapies.

Fig. 3
Fig. 3 Area under the curve (AUC) for Full MetaMap (binary), Full MetaMap (counts), and MetaMap mixed model in NU testing set

Fig. 4 Fig. 5
Fig. 4 SHAP plot for full MetaMap (binary) model with SHAP feature importance measured as the mean absolute Shapley values.The Features are ordered according to their importance.The SHAP bar plot shows global of each feature which is taken to be the mean absolute SHAP value for that feature over all the given samples.The most important feature in this plot is C0027697 which has a global importance of 0.083 compared to an average feature global importance of 0.006.C0027697: nephritis; C0024143: lupus nephritis; C0194073: kidney biopsy; C0033687: proteinuria; C0022658: kidney diseases; C1318439: urine creatinine measurement; C045555: H/O: nephritis; C0428283: urine creatinine level finding; C0262923: Urine protein test

Fig.
Fig. SHAP plot for MetaMap mixed model with SHAP feature importance measured as the mean absolute Shapley values.The Features are ordered according to their importance.The SHAP bar plot shows global importance of each feature which is taken to be the mean absolute SHAP value for that feature over all the given samples.The most important feature in this plot is renal_C002413 which has a global importance of 0.916 compared to an average feature global importance of 0.192.RENAL: renal indictor from structured data; renal_C4054543: membranous lupus nephritis; renal_C0268758: SLE glomerulonephritis syndrome, WHO class V; renal_C4053955: Systemic Lupus Erythematosus Nephritis Class IV; renal_C4053959: Systemic Lupus Erythematosus Nephritis Class III

Table 1
Algorithm description

Table 2
Model performanceFor logistic regression-based models, probability of 0.5 is used as the threshold for classification Abbreviations: SLE systemic lupus erythematosus, NU Northwestern University, VUMC Vanderbilt University Medical Center, NLP natural language processing, PPV positive predictive value, NPV negative predicted value

Table 3
Top 5 positive coefficient for each classifier Features are ranked by the value of their associated coefficients RENAL renal indictor from structured data only, gm gram